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ABSTRACT 


The FAIR principles describe characteristics intended to support access to and reuse of digital artifacts in 
the scientific research ecosystem. Persistent, globally unique identifiers, resolvable on the Web, and 
associated with a set of additional descriptive metadata, are foundational to FAIR data. Here we describe 
some basic principles and exemplars for their design, use and orchestration with other system elements to 
achieve FAIRness for digital research objects. 
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1. INTRODUCTION 


The FAIR principles [1] are a model to guide the implementation of interoperable digital research artifacts 
which are fully discoverable and ultimately reusable in the scientific research ecosystem. These digital 
artifacts must be preserved long term as the basis of modern scientific research. However, recent publications 
suggest that up to 80% of scientific research data are lost within 20 years [2]. This estimate does not take 
into account the effects of practical unavailability of such data, when they have not been robustly and 
accessibly archived, even when they are not actually “lost”. Data attrition blocks the ability to validate 
results and reuse information in science. 


“FAIR” as an approach to reduce data attrition and support reuse has seen growing support in EU- and 
US-funded projects® and influential science policy recommendations [3]. 


This article focuses specifically on identifiers as required for practical implementation of the foundational 
“F” element of FAIR, the “Findability” of digital research artifacts. All the other FAIR principles (“A”, “I”, 
and “R”) are based upon it. We cannot access, interoperate or reuse data without knowing how to identify 
them. And ultimately, all criteria of FAIR influence how we must provision identifiers. 


FAIR data, software, and other digital research objects collectively support the validity of scientific 
research as communicated in the literature. Likewise, the scientific literature in turn supports FAIR data, 
when properly referenced, by describing the methods by which the data have been obtained. 


2. IDENTIFIERS RESOLVABLE ON THE INTERNET 


To be “Findable”, digital objects must be uniquely specifiable, i.e., they must have unique names, and 
must be locatable, by association of the unique name with a protocol for retrieval. They must also have at 
least some associated descriptive metadata to assist discovery and verification of that object when the 
identifier is not known a priori. Recent work on scientific data “commons” architectures [4] and multi- 
commons “data ecosystem frameworks” [5], highlighted here, reinforces the FAIR Accessibility principle 
that identifiers must be paired with such descriptive metadata [6,7,8]. 


An identifier is a unique name given to an object, property, set or class. Digital identifiers, by which we 
mean ordered strings of characters, uniquely name digital resources. An identifier may sometimes be “read”, 
as if its string revealed meaningful semantics, but fundamentally, the underlying semantics of an identifier 
is that it means the resource to which it refers. Table 1 shows four common FAIR identifier types, which 
we discuss in detail later in this article. 


® See https://gdc.cancer.gov, https://www.fairsfair.edu, and https://fairplus-project.edu. 
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Table 1. Examples of FAIR digital identifiers. 


Identifier Type Example Object Identified 

DOI https://doi.org/10.1038/sdata.2016.18 A digital object — the publication 
Wilkinson et al. 2016, Sci Data 3,160018 

ARK http://n2t.net/ark:/65665/3c2e39526-e0c3- A physical specimen in the Smithsonian 

41ae-be4f-07558a9458eb collection — holotype of the insect 

Z. wardiana 

Identifiers.org https://identifiers.org/uniprot:PODP23 A digital object — information on the protein 
Calmodulin-1 

PURL http://purl.org/dc/terms/publisher An abstract term in the DCTERMS vocabulary 
— “publisher” 


Identifiers have a scope within which they are verifiably unique. For digital artifacts this may be local 
scope within some particular data collection, as is the case with typical “accession numbers”, the concept 
for which originated in museum, library and specimen collection practice and was extended to the 
biomedical sequence and literature databases. 


In current best citation practice, identifiers must be globally unique and resolvable on the Internet, where 
digital identifiers are ordinary Web addresses, technically URIs, or Uniform Resource Identifiers (URIs)[9]. 
A URI is a compact sequence of characters uniquely identifying a resource. Resources may be physical 
things (e.g., a dog, building, microscope, star, and person), which are described by associated information), 
digital (e.g., a document, data set, catalog record, program executable, service endpoint, or the actual 
objects of interest), or abstract (e.g., a vocabulary term). 


URIs begin with a scheme name which specifies the method used to resolve them, e.g., http, urn, ftp, 
LDAP, etc. In this article we will be speaking primarily of secure http URIs, which begin with “https://” 
[10], but using the term http URI to mean both secure and insecure variants. And we will discuss their 
resolution in the context of REST interfaces [11,12], the canonical way to access required metadata, 
including the resolution endpoints, for any persistent identifier. 


An http URI gives a protocol for finding (resolving) and accessing the information resource to which it 
refers, that is, it is used to locate the resource. As noted, we find the “Accessible” requirement of FAIR is 
also embedded in secure https URI based identifiers. 


3. IDENTIFIER PERSISTENCE AND INDIRECTION 


The hosting of identified resources can change; publishing companies or data archives may merge, or 
be acquired, or the internal infrastructure at an institution may be revamped, modifying the presentation 
of a resource’s URIs. This means there is a problem if identifiers are dependent upon local hostnames; this 
can be mitigated by indirection through central public resolvers. These provide resolution for the identifier 
URI by redirecting http GET requests [12] to a different http URI hosted by the information provider. 
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Identifiers may be made indirect by (Pattern 1) registering each one with a redirecting service (along with 
some metadata) as it is issued; or (Pattern 2) by registering a unique namespace, identifier pattern and 
resolver endpoint for the resource provider just once. In either case indirect identifiers become “persistable” 
in that changes to the endpoint URIs can be made invisible and non-disruptive to users either individually 
(Pattern 1 above) or in bulk (Pattern 2) when the provider hosting structure changes. Pattern 1 allows us 
to retrieve core metadata elements for individual identifiers. Pattern 2 allows us to retrofit global resolution 
and persistence behavior to systems that formerly assigned only locally unique identifiers such as database 
accession numbers. 


4. EXEMPLAR PERSISTENT, RESOLVABLE IDENTIFIERS 


A variety of services have been developed and made available in support of persistent identifiers, 
facilitating their consistent use/reuse, as well as global collaborative efforts to converge upon best practices 
in these areas. We highlight some exemplar services here, all of which comply with the recommendations 
in [13]. 


4.1 Digital Object Identifiers 


A very well-known example of the Pattern 1 approach (above) is the Digital Object Identifier (DOI) 
system [14]. DOls are widely used in the publishing industry to identify documents, and as data set 
identifiers and for data citation and interoperability, besides other uses. They provide a canonical example 
of how Pattern 1 should be implemented. 


The International DOI Foundation® maintains the DOI system, supported by its various domain-centric 
Registration Agencies (RAs). The most important RA for our purposes is DataCite®, a member supported 
organization that registers data sets [15]. DataCite membership gives an organization the right to mint a 
certain number of DOls per year, at various subscription levels. Over 16 million unique DOIs have been 
registered with DataCite. DOI resolution is free. 


DOls consist of a DOI name which is resolved at https://doi.org, with the full URI formulated according 
to the pattern https://doi.org/<DOI_name>. DOI names in turn consist of a prefix and a suffix, separated by 
a forward slash. The prefix is a code indicating the registrant who issues the DOI, e.g., Harvard University 
Dataverse — 10.7910; Dryad Digital Repository — 10.5061. The suffix is the identifier, in any form, assigned 
by the registrant. 


®  https://doi.org. 
®  https://datacite.org. 
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Thus an http GET request to: 


https://doi.org/10.5061/dryad.k5r2bbO 


Redirects to the Dryad entry at this URI: 


https://datadryad.org/resource/doi:10.5061/dryad.k5r2bb0 


which returns metadata about the entry: 


Data from: The Muensterelloidea: phylogeny and character evolution of Mesozoic stem octopods 
Authors: Fuchs D, Iba Y, Heyng A, lijima M, Klug C, Larson NL, Schweigert G 
Date Published: June 28, 2019 


including the resolution endpoint for its data package, here: 


Data package: https://datadryad.org/bitstream/handle/10255/dryad.192323/Appendix%202_Data%20 
matrix.nex?sequence=1 


Metadata, including the data package URI, will have been registered with DataCite at the time the 
identifier was assigned. Full resolution to the actual data is a multi-step process. First, the client sends an 
http GET request to the doi.org resolver for the DOI of interest. Then doi.org sends a redirect message to 
the client with the address of the data set’s “landing page”, which holds the metadata, including the 
resolution endpoint(s) for its data. 


Why is this multi-redirect process required? The first redirect separates the endpoint of the data from their 
persistent identifier, providing isolation from infrastructure changes at the provider side. The second redirect 
allows decisions to be made about how to handle the data set prior to access. For example, we may want 
to ensure that we have the correct data set by scanning the metadata; we may not want to directly download 
a large data set before we ascertain it is the correct one (via the metadata); and if very large, we may wish 
to consider transit costs, and transport speed, before we decide how to access it. If the data exist at multiple 
mirror sites, we might want to consider proximity. These decisions can be made in real time by humans 
working at a browser, or on a command line; or they can be made algorithmically by “content negotiation” 
between services [16,17]. 


Metadata at the DataCite service is provided in both human and machine readable forms. Machine 
readable metadata is provided both in DataCite XML and in the schema.org [18] vocabulary, serialized as 
JSON-LD [19]. Future work may also enable incorporation of terms from domain-specific schema.org 
extensions such as bioschemas® [20]. 


® https://bioschemas.org. 
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4.2 Archival Resource Keys 


Archival Resource Keys (ARKs) are another widely used persistent identifier, supported by the California 
Digital Library [21], in collaboration with DuraSpace®. ARKs work similarly to DOls, but are more permissive 
in design. Over 500 registered organizations have created over 3.2 billion ARKs. There is no fee for 
registering or resolving ARKs. Metadata is strongly recommended for publicly released ARKs, but is optional 
and flexible while the resource is being planned, reviewed, tested, etc. 


The ARK scheme is decentralized; any organization can run its own resolver, but all can be resolved 
through the global Name-to-Thing resolver®. N2T operates in two modes. It stores records for millions 
of ARKs registered by the EZID service? [22] and the Internet Archive. And in collaboration with 
identifiers.org (below), it stores 3,500 rules for redirecting identifiers that it does not store individually. 


ARKs have the syntax: 
https://NMA/ark:/NAAN/Name[Qualifier] 
where: 


e NMA is the Name Mapping Authority; 

e NAAN is the Name Assigning Authority Number (like a DOI prefix), 
e Name is the local identifier issued by the NAAN, and 

e [Qualifier] is an optional qualifier. 


Name Mapping Authorities operate ARK resolvers. Name Assigning Authorities have NAANs allowing 
them to mint globally unique ARKs. 


4.3 Identifiers.org and CURIEs 


The identifiers.org system [23,24,25] has been providing globally resolvable http URI-based identifiers 
to the scientific community for over a decade. The core system consists of a resolver and a namespace 
mapper, following Pattern 2 (Section 3). It registers a unique namespace, a namespace prefix, an identifier 
regex and a local identifier resolver URI, to provide globally unique persistent identifiers. It records metadata 
only for the identifier namespace, not for individual identifiers, relying upon the data provider to expose 
the appropriate metadata within the resolved landing page. 


Identifiers.org URIs follow the syntax: 
https://Resolver/[Provider/] Namespace:|dentifier 
where 


®  https://ARKsInTheOpen.org. 
® https://N2T.net/. 
®  https://ezid.cdlib.org. 
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e Resolver is resolution host, currently either identifiers.org or n2t.net; 

e Provider is an optional term indicating specific providers for a Namespace, e.g., “RCSB”, “PDBe”, 
“PDBj”, etc., for the Protein Data Bank’s various providers; 

e Namespace is a registered prefix assigned to an identifier issuing authority; and 

e Identifier is the locally assigned identifier. 


Taken together, a namespace prefix and colon followed by a local identifier constitute a CURIE, or 
Compact Identifier [13,23]. 


Identifiers.org records all known resource-specific access URIs for a data set or data record, including 
alternative access URIs. This information allows numerous equivalent URIs to be considered to be a single 
canonical URI string. This mapping, retrievable as a data set in its own right, is integral to platforms like 
EMBL-EBI RDF and Open PHACTS [26], where identifiers.org URIs act as a semantic glue in distributed 
queries. 


4.4 PURLs 


Persistent Uniform Resource Locators (PURLs) generically are http URIs whose services redirect GET 
requests to another location-based URI, with the added promise of long-term persistence. All identifier types 
discussed above meet this definition. However, in common usage, a PURL is an identifier minted and 
resolved at https://purl.org, hosted since 2016 by the Internet Archive®, after being transferred from OCLC, 
where the service started in 1995. That the ownership was transferred to the Internet Archive highlights that 
particular care was taken to preserve and continue operation of this identifier infrastructure and its 
registrations. A separate PURL registry exists at https://purl.obolibrary.org, hosted by the OBO Foundry. 
OBO Library PURLs are widely used by the OBO ontologies community to persistently reference their 
information artifacts, for which they are well suited®. 


Users of purl.org register to gain control of a sub-path, e.g., http://purl.org/dc/ and then assign and modify 
the redirection targets for individual entries like http://purl.org/dc/terms/. The ability to update the PURL 
target is crucial for persistence, for instance if ownership of a service changes, or the server infrastructure 
changes URL layout. In this sense a PURL is a counter-measure to unintended violations of “Cool URIs 
don’t change” [27]. 


5. DATA CITATION 


The Joint Declaration of Data Citation Principles (DDCP) provides a set of guiding principles for 
publishing and citing research data. Many implementation guidelines developed for JDDCP [15, 16, 23, 
28, 29] provide rich foundational practices for FAIR, including globally unique persistent identifiers, indirect 


https://archive.org. 
© However, we do not recommend PURLs in general as primary identifiers for persistently archived FAIR data. 
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resolution to a landing page or service, and accessibility of human and machine readable metadata. In our 
view, it is not possible to reliably reuse data without robust links to provenance documentation. Ultimately, 
literature describing how data were obtained provides the rationale for accepting them as reliable. 
Conversely, no assertion in the scientific literature is ultimately credible unless backed by reliable data. 
Good data sharing, archiving and citation practices followed by authors, publishers, repositories, funders 
and institutions, along with good identifier and metadata practices, and appropriate credit incentives 
[30,31], will promote creation of normative FAIR data integrally associated with the literature. 


6. IDENTIFIERS PLUS METADATA 


Sensible strategies that work across projects, domains and infrastructures permit realization of an open 
landscape where science can operate reliably with massive amounts of data. Large scale “data commons” 
projects such as the European Open Science Cloud, the NCI Genomic Data Commons, and the NHLBI 
Data Stage are initial steps in this direction. Investigators in such programs understand that frameworks for 
interoperating across commons in a “data ecosystem” [5] of FAIR research objects require not only persistent 
resolvable identifiers, but well-documented APIs for associated human- and machine-readable metadata. 
These metadata should optimally include not only provenance information but descriptive “guide” metadata 
to enable discoverability. A large, active community is engaged in determining and establishing best 
practices around persistent identifiers, metadata standards and exposure mechanisms; there are many active 
“interest” and “working” groups engaged in such activities under the auspices of the Research Data 
Alliance®. 


Properly managed persistent identifiers with their associated metadata are the primary access point to 
enable Findability in the FAIR data ecosystem, and to provide a basis for all the other FAIR attributes. 
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